Skip to content

[Feat] [history server] Add node endpoint#4436

Open
JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-node-endpoint
Open

[Feat] [history server] Add node endpoint#4436
JiangJiaWei1103 wants to merge 50 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-node-endpoint

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 24, 2026

Why are these changes needed?

For full context, please refer to this design_doc.

This PR adds support for the /nodes and /nodes/{node_id} endpoints to the history server.

With these endpoints, the history server can reconstruct node state for terminated (dead) clusters, enabling post-mortem analysis through historical event replay. This includes node summaries and snapshots of node-level resources at different time points.

Change Summary

Screenshot 2026-01-26 at 11 41 27 AM

At a high level, this PR introduces changes across two main layers:

History Server Layer

  • Implement the core logic for the /nodes?view=summary and /nodes/{node_id} endpoints
    • Retrieve the target node map from the event server layer
    • Construct and format node summaries and node-level resource snapshots for API responses

Event Server Layer

  • Define node-related data structures and public interfaces
  • Introduce a reusable state transition interface for merging node state transitions
  • Process NODE_DEFINITION_EVENT and NODE_LIFECYCLE_EVENT
  • Build and maintain a ClusterNodeMap, and expose a GetNodeMap helper for consumption by the history server layer

Test Result

Screenshot 2026-01-27 at 11 58 40 AM

Related issue number

Closes #4376.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@@ -0,0 +1,52 @@
package eventserver
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file defines the interface and processing helpers (e.g., MergeStateTransitions) that can be reused across different events' state transitions.

Comment on lines 20 to 31
// injectCollectorRayClusterID injects the ray-cluster-id argument into all collector containers.
func injectCollectorRayClusterID(containers []corev1.Container, rayClusterID string) {
for i := range containers {
if containers[i].Name == "collector" {
containers[i].Command = append(
containers[i].Command,
fmt.Sprintf("--ray-cluster-id=%s", rayClusterID),
)
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is proposed by @machichima in this PR.

body, err := io.ReadAll(resp.Body)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(resp.StatusCode).To(Equal(200),
gg.Expect(resp.StatusCode).To(Equal(http.StatusOK),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use HTTP status code to preserve semantics without hardcoding the code number.

// isLive indicates whether the response is from a live cluster or a dead cluster:
// - isLive: true for a live cluster (current snapshot)
// - isLive: false for a dead cluster (historical replay)
func verifyNodesRespSchema(test Test, g *WithT, nodesResp map[string]any, isLive bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that different endpoints will have exclusive schema verification, we should extract them to separate files to avoid messing up the main test file.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Comment on lines 246 to 255
func ConvertBase64ToHex(input string) (string, error) {
bytes, err := base64.StdEncoding.DecodeString(input)
if err != nil {
return "", err
}

hexStr := hex.EncodeToString(bytes)

return hexStr, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion is implemented by @chiayi in this PR.

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review January 26, 2026 03:51
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
JiangJiaWei1103 and others added 4 commits January 30, 2026 09:29
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier and others added 3 commits January 30, 2026 23:15
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier and others added 2 commits January 30, 2026 23:43
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

JiangJiaWei1103 and others added 3 commits February 6, 2026 16:46
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Comment on lines +1289 to +1301
// constructResourceString constructs a resource string based on the resources in state transition.
// Ref: https://github.com/ray-project/ray/blob/f953f199b5d68d47c07c865c5ebcd2333d49f365/python/ray/autoscaler/_private/util.py#L643-L665.
func constructResourceString(resources map[string]float64) string {
resourceKeys := make([]string, 0, len(resources))
for k := range resources {
resourceKeys = append(resourceKeys, k)
}
sort.Strings(resourceKeys)

resourceString := ""
for _, k := range resourceKeys {
v := resources[k]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we include other accelerator type here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we write something like this?

if k == "memory" || k == "object_store_memory" {
    formattedUsed := "0B"
    formattedTotal := formatMemory(v)
    resourceString += fmt.Sprintf("%s/%s %s", formattedUsed, formattedTotal, k)
} else if strings.HasPrefix(k, "node:") {
    continue  // Skip per-node resources
} else if strings.HasPrefix(k, "accelerator_type:") {
    continue  // Skip accelerator_type (Issue #33272)
} else {
    // Handle CPU, GPU, TPU, and other resources
    resourceString += fmt.Sprintf("%.1f/%.1f %s", 0.0, v, k)
}

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

resourceString = strings.TrimSuffix(resourceString, "\n")

return resourceString
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource string omits GPU and custom accelerator resources

Medium Severity

constructResourceString only formats "CPU", "memory", and "object_store_memory" resources, using continue to skip everything else including "GPU" and other accelerator types. For GPU-enabled clusters, the resource string will be missing GPU information entirely. The referenced Python implementation at util.py#L643-L665 likely formats all resource types, not just these three.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] Node Endpoint

3 participants